Proyecto de Curso – Ingeniería de Características – Fase 1.¶

Catedrático: Ing. Preng Biba Solares

Auxiliar: Ing. Jorge Alberto Osoy Barrera

Curso: Statics Learning

Alumnos participantes: Jordi Gian Carlo Chajón López (Carnet 23000477) y Felipe Carlos Escoto Castro (Carnet 23003984).

In [ ]:
import pandas as pd
import numpy as np
import  matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
In [ ]:
dataset =pd.read_csv("global-data-on-sustainable-energy.csv")
dataset.head()
Out[ ]:
Entity Year Access to electricity (% of population) Access to clean fuels for cooking Renewable-electricity-generating-capacity-per-capita Financial flows to developing countries (US $) Renewable energy share in the total final energy consumption (%) Electricity from fossil fuels (TWh) Electricity from nuclear (TWh) Electricity from renewables (TWh) ... Primary energy consumption per capita (kWh/person) Energy intensity level of primary energy (MJ/$2017 PPP GDP) Value_co2_emissions_kt_by_country Renewables (% equivalent primary energy) gdp_growth gdp_per_capita Density (P/Km2) Land Area(Km2) Latitude Longitude
0 Afghanistan 2000 1.613591 6.2 9.22 20000.0 44.99 0.16 0.0 0.31 ... 302.59482 1.64 760.000000 NaN NaN NaN 60 652230 33.93911 67.709953
1 Afghanistan 2001 4.074574 7.2 8.86 130000.0 45.60 0.09 0.0 0.50 ... 236.89185 1.74 730.000000 NaN NaN NaN 60 652230 33.93911 67.709953
2 Afghanistan 2002 9.409158 8.2 8.47 3950000.0 37.83 0.13 0.0 0.56 ... 210.86215 1.40 1029.999971 NaN NaN 179.426579 60 652230 33.93911 67.709953
3 Afghanistan 2003 14.738506 9.5 8.09 25970000.0 36.66 0.31 0.0 0.63 ... 229.96822 1.40 1220.000029 NaN 8.832278 190.683814 60 652230 33.93911 67.709953
4 Afghanistan 2004 20.064968 10.9 7.75 NaN 44.24 0.33 0.0 0.56 ... 204.23125 1.20 1029.999971 NaN 1.414118 211.382074 60 652230 33.93911 67.709953

5 rows × 21 columns

Ánalisis Exploratorio Dataset Original¶

In [ ]:
# Mostrar información sobre el dataset
print(dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3648 entries, 0 to 3647
Data columns (total 21 columns):
 #   Column                                                            Non-Null Count  Dtype  
---  ------                                                            --------------  -----  
 0   Entity                                                            3648 non-null   object 
 1   Year                                                              3648 non-null   int64  
 2   Access to electricity (% of population)                           3648 non-null   float64
 3   Access to clean fuels for cooking                                 3648 non-null   float64
 4   Renewable-electricity-generating-capacity-per-capita              3648 non-null   float64
 5   Financial flows to developing countries (US $)                    3648 non-null   float64
 6   Renewable energy share in the total final energy consumption (%)  3648 non-null   float64
 7   Electricity from fossil fuels (TWh)                               3648 non-null   float64
 8   Electricity from nuclear (TWh)                                    3648 non-null   float64
 9   Electricity from renewables (TWh)                                 3648 non-null   float64
 10  Low-carbon electricity (% electricity)                            3648 non-null   float64
 11  Primary energy consumption per capita (kWh/person)                3648 non-null   float64
 12  Energy intensity level of primary energy (MJ/$2017 PPP GDP)       3648 non-null   float64
 13  Value_co2_emissions_kt_by_country                                 3648 non-null   float64
 14  Renewables (% equivalent primary energy)                          3648 non-null   float64
 15  gdp_growth                                                        3648 non-null   float64
 16  gdp_per_capita                                                    3648 non-null   float64
 17  Density (P/Km2)                                                   3648 non-null   object 
 18  Land Area(Km2)                                                    3648 non-null   int64  
 19  Latitude                                                          3648 non-null   float64
 20  Longitude                                                         3648 non-null   float64
dtypes: float64(17), int64(2), object(2)
memory usage: 598.6+ KB
None
In [ ]:
# Describir estadísticamente el dataset
print(dataset.describe())
              Year  Access to electricity (% of population)  \
count  3648.000000                              3648.000000   
mean   2010.041118                                78.933693   
std       6.052776                                30.238162   
min    2000.000000                                 1.252269   
25%    2005.000000                                59.916941   
50%    2010.000000                                98.272340   
75%    2015.000000                               100.000000   
max    2020.000000                               100.000000   

       Access to clean fuels for cooking  \
count                        3648.000000   
mean                           63.255504   
std                            38.133777   
min                             0.000000   
25%                            25.862500   
50%                            78.875000   
75%                           100.000000   
max                           100.000000   

       Renewable-electricity-generating-capacity-per-capita  \
count                                        3648.000000      
mean                                           92.493616      
std                                           213.395777      
min                                             0.000000      
25%                                             8.380000      
50%                                            32.880000      
75%                                            67.472500      
max                                          3060.190000      

       Financial flows to developing countries (US $)  \
count                                    3.648000e+03   
mean                                     4.353562e+07   
std                                      1.998022e+08   
min                                      0.000000e+00   
25%                                      5.665000e+06   
50%                                      5.665000e+06   
75%                                      5.665000e+06   
max                                      5.202310e+09   

       Renewable energy share in the total final energy consumption (%)  \
count                                        3648.000000                  
mean                                           32.142379                  
std                                            29.168672                  
min                                             0.000000                  
25%                                             7.095000                  
50%                                            23.270000                  
75%                                            52.612500                  
max                                            96.040000                  

       Electricity from fossil fuels (TWh)  Electricity from nuclear (TWh)  \
count                          3648.000000                     3648.000000   
mean                             69.996209                       12.989315   
std                             347.131749                       71.786293   
min                               0.000000                        0.000000   
25%                               0.300000                        0.000000   
50%                               2.970000                        0.000000   
75%                              26.527500                        0.000000   
max                            5184.130000                      809.410000   

       Electricity from renewables (TWh)  \
count                        3648.000000   
mean                           23.845069   
std                           104.157507   
min                             0.000000   
25%                             0.050000   
50%                             1.470000   
75%                             9.560000   
max                          2184.940000   

       Low-carbon electricity (% electricity)  \
count                             3648.000000   
mean                                36.708904   
std                                 34.129226   
min                                  0.000000   
25%                                  3.030303   
50%                                 27.910000   
75%                                 64.038130   
max                                100.000010   

       Primary energy consumption per capita (kWh/person)  \
count                                        3648.000000    
mean                                        25747.285360    
std                                         34777.415694    
min                                             0.000000    
25%                                          3116.636825    
50%                                         13118.841000    
75%                                         33897.402500    
max                                        262585.700000    

       Energy intensity level of primary energy (MJ/$2017 PPP GDP)  \
count                                        3648.000000             
mean                                            5.250461             
std                                             3.438690             
min                                             0.110000             
25%                                             3.220000             
50%                                             4.300000             
75%                                             5.880000             
max                                            32.570000             

       Value_co2_emissions_kt_by_country  \
count                       3.648000e+03   
mean                        1.423831e+05   
std                         7.285451e+05   
min                         1.000000e+01   
25%                         2.509557e+03   
50%                         1.050000e+04   
75%                         5.136250e+04   
max                         1.070722e+07   

       Renewables (% equivalent primary energy)   gdp_growth  gdp_per_capita  \
count                               3648.000000  3648.000000     3648.000000   
mean                                   8.651135     3.441471    12613.230060   
std                                   10.051457     5.434772    19077.099547   
min                                    0.000000   -62.075920      111.927225   
25%                                    6.290000     1.651476     1464.841885   
50%                                    6.290000     3.440000     4578.630000   
75%                                    6.290000     5.543696    13993.509465   
max                                   86.836586   123.139555   123514.196700   

       Land Area(Km2)     Latitude    Longitude  
count    3.648000e+03  3648.000000  3648.000000  
mean     6.332135e+05    18.246388    14.822695  
std      1.585519e+06    24.159232    66.348148  
min      2.100000e+01   -40.900557  -175.198242  
25%      2.571300e+04     3.202778   -11.779889  
50%      1.176000e+05    17.189877    19.145136  
75%      5.131200e+05    38.969719    46.199616  
max      9.984670e+06    64.963051   178.065032  

1. Determine que columnas poseen faltantes (NA o Nulos)¶

In [ ]:
col_con_nan = []

for col in dataset.columns:
    porcentaje_faltante = dataset[col].isnull().mean()
    if(porcentaje_faltante > 0):
        col_con_nan.append(col)
col_con_nan
Out[ ]:
['Access to electricity (% of population)',
 'Access to clean fuels for cooking',
 'Renewable-electricity-generating-capacity-per-capita',
 'Financial flows to developing countries (US $)',
 'Renewable energy share in the total final energy consumption (%)',
 'Electricity from fossil fuels (TWh)',
 'Electricity from nuclear (TWh)',
 'Electricity from renewables (TWh)',
 'Low-carbon electricity (% electricity)',
 'Energy intensity level of primary energy (MJ/$2017 PPP GDP)',
 'Value_co2_emissions_kt_by_country',
 'Renewables (% equivalent primary energy)',
 'gdp_growth',
 'gdp_per_capita']

2. Se determino la proporción de faltantes para cada columna con faltantes y se muestra en un gráfico de barras con el porcentaje de faltantes para cada columna.¶

In [ ]:
porcentaje_nulos = dataset[col_con_nan].isnull().mean()
porcentaje_nulos_redondeado = round(porcentaje_nulos * 100, 2)
porcentaje_nulos_redondeado
Out[ ]:
Access to electricity (% of population)                              0.25
Access to clean fuels for cooking                                    4.61
Renewable-electricity-generating-capacity-per-capita                25.52
Financial flows to developing countries (US $)                      57.24
Renewable energy share in the total final energy consumption (%)     5.32
Electricity from fossil fuels (TWh)                                  0.58
Electricity from nuclear (TWh)                                       3.45
Electricity from renewables (TWh)                                    0.58
Low-carbon electricity (% electricity)                               1.15
Energy intensity level of primary energy (MJ/$2017 PPP GDP)          5.65
Value_co2_emissions_kt_by_country                                   11.71
Renewables (% equivalent primary energy)                            58.55
gdp_growth                                                           8.66
gdp_per_capita                                                       7.70
dtype: float64
In [ ]:
fig, ax = plt.subplots(figsize=(8, 5))

ax.bar(porcentaje_nulos.index, porcentaje_nulos, color='skyblue')
ax.set_title('VALORES NULOS POR COLUMNA')
ax.set_ylabel('Porcentaje')
ax.set_xlabel('Columnas con nulos')

plt.xticks(rotation=90)
plt.tight_layout() 
plt.show()
C:\Users\escot\AppData\Local\Temp\ipykernel_22756\4160165302.py:9: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all axes decorations.
  plt.tight_layout()
No description has been provided for this image

Notamos que hay varias columna que tiene datos faltantes por lo que procederemos a identificar la escala de cada una. Es decir, clasificaremos entre variables categóricas, continuas y discretas

In [ ]:
categoricas = [col for col in dataset.columns if(dataset[col].dtypes == 'object')]
categoricas
Out[ ]:
['Entity', 'Density (P/Km2)']
In [ ]:
categoricas_con_na = [col for col in categoricas if dataset[col].isnull().mean() > 0]
dataset[categoricas_con_na].isnull().mean()
Out[ ]:
Series([], dtype: float64)
In [ ]:
continuas = [col for col in dataset.columns if((dataset[col].dtypes in ['int64', 'float64']) and len(dataset[col].unique()) > 30)]
continuas
Out[ ]:
['Access to electricity (% of population)',
 'Access to clean fuels for cooking',
 'Renewable-electricity-generating-capacity-per-capita',
 'Financial flows to developing countries (US $)',
 'Renewable energy share in the total final energy consumption (%)',
 'Electricity from fossil fuels (TWh)',
 'Electricity from nuclear (TWh)',
 'Electricity from renewables (TWh)',
 'Low-carbon electricity (% electricity)',
 'Primary energy consumption per capita (kWh/person)',
 'Energy intensity level of primary energy (MJ/$2017 PPP GDP)',
 'Value_co2_emissions_kt_by_country',
 'Renewables (% equivalent primary energy)',
 'gdp_growth',
 'gdp_per_capita',
 'Land Area(Km2)',
 'Latitude',
 'Longitude']
In [ ]:
discretas = [col for col in dataset.columns if((dataset[col].dtypes in ['int64', 'float64']) and len(dataset[col].unique()) <= 30)]
discretas
Out[ ]:
['Year']
In [ ]:
discretas_con_na = [col for col in discretas if dataset[col].isnull().mean() > 0]
dataset[discretas_con_na].isnull().mean()
Out[ ]:
Series([], dtype: float64)

1.1 Imputación de variables numéricas continuas:¶

Detectamos el porcentaje de faltantes en la variables numéricas continuas y seleccinamos aquellas variables que tiene valores faltantes

In [ ]:
continuas_con_na = [col for col in continuas if dataset[col].isnull().mean() > 0]
dataset[continuas_con_na].isnull().mean()
Out[ ]:
Access to electricity (% of population)                             0.002467
Access to clean fuels for cooking                                   0.046053
Renewable-electricity-generating-capacity-per-capita                0.255208
Financial flows to developing countries (US $)                      0.572368
Renewable energy share in the total final energy consumption (%)    0.053180
Electricity from fossil fuels (TWh)                                 0.005757
Electricity from nuclear (TWh)                                      0.034539
Electricity from renewables (TWh)                                   0.005757
Low-carbon electricity (% electricity)                              0.011513
Energy intensity level of primary energy (MJ/$2017 PPP GDP)         0.056469
Value_co2_emissions_kt_by_country                                   0.117050
Renewables (% equivalent primary energy)                            0.585526
gdp_growth                                                          0.086623
gdp_per_capita                                                      0.077029
dtype: float64
In [ ]:
continuas_con_na = [col for col in continuas if dataset[col].isnull().mean() > 0.06]
dataset[continuas_con_na].isnull().mean()
Out[ ]:
Renewable-electricity-generating-capacity-per-capita    0.255208
Financial flows to developing countries (US $)          0.572368
Value_co2_emissions_kt_by_country                       0.117050
Renewables (% equivalent primary energy)                0.585526
gdp_growth                                              0.086623
gdp_per_capita                                          0.077029
dtype: float64

Dado que todas las variables tiene un porcentaje de NAN's mas grande del 5% será necesario realizar un anális particular para cada caso

1.1.1 Análisis para Variable "Access to electricity (% of population)"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['Access to electricity (% of population)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Access to electricity (% of population)')
plt.show()
No description has been provided for this image
Imputación por Media y Mediana¶
In [ ]:
mean_Access_to_electricity_of_population = round(dataset['Access to electricity (% of population)'].mean(), 2)

temp_series = dataset['Access to electricity (% of population)'].fillna(mean_Access_to_electricity_of_population)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Access to electricity (% of population) - ' + str(mean_Access_to_electricity_of_population))
plt.show()
No description has been provided for this image
In [ ]:
median_Access_to_electricity_of_population = round(dataset['Access to electricity (% of population)'].median(), 2)

temp_series = dataset['Access to electricity (% of population)'].fillna(median_Access_to_electricity_of_population)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Access to electricity (% of population) - ' + str(median_Access_to_electricity_of_population))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por mean

In [ ]:
dataset['Access to electricity (% of population)'].fillna(mean_Access_to_electricity_of_population, inplace=True)
1.1.2 Análisis para Variable "Access to clean fuels for cooking"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['Access to clean fuels for cooking'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Access to clean fuels for cooking')
plt.show()
No description has been provided for this image
Imputación por Media y Mediana¶
In [ ]:
mean_Access_to_clean_fuels_for_cooking = round(dataset['Access to clean fuels for cooking'].mean(), 2)

temp_series = dataset['Access to clean fuels for cooking'].fillna(mean_Access_to_clean_fuels_for_cooking)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Access to clean fuels for cooking - ' + str(mean_Access_to_clean_fuels_for_cooking))
plt.show()
No description has been provided for this image
In [ ]:
median_Access_to_clean_fuels_for_cooking = round(dataset['Access to clean fuels for cooking'].median(), 2)

temp_series = dataset['Access to clean fuels for cooking'].fillna(median_Access_to_clean_fuels_for_cooking)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Access to clean fuels for cooking - ' + str(median_Access_to_clean_fuels_for_cooking))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por mean

In [ ]:
dataset['Access to clean fuels for cooking'].fillna(mean_Access_to_clean_fuels_for_cooking, inplace=True)
1.1.3 Análisis para Variable "Renewable-electricity-generating-capacity-per-capita"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['Renewable-electricity-generating-capacity-per-capita'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewable-electricity-generating-capacity-per-capita')
plt.show()
No description has been provided for this image
Imputación por Media y Mediana¶
In [ ]:
mean_Renewable_electricity_generating_capacity_per_capita = round(dataset['Renewable-electricity-generating-capacity-per-capita'].mean(), 2)

temp_series = dataset['Renewable-electricity-generating-capacity-per-capita'].fillna(mean_Renewable_electricity_generating_capacity_per_capita)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewable-electricity-generating-capacity-per-capita - ' + str(mean_Renewable_electricity_generating_capacity_per_capita))
plt.show()
No description has been provided for this image
In [ ]:
median_Renewable_electricity_generating_capacity_per_capita = round(dataset['Renewable-electricity-generating-capacity-per-capita'].median(), 2)

temp_series = dataset['Renewable-electricity-generating-capacity-per-capita'].fillna(median_Renewable_electricity_generating_capacity_per_capita)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewable-electricity-generating-capacity-per-capita - ' + str(median_Renewable_electricity_generating_capacity_per_capita))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por median

In [ ]:
dataset['Renewable-electricity-generating-capacity-per-capita'].fillna(median_Renewable_electricity_generating_capacity_per_capita, inplace=True)
1.1.4 Análisis de Variable "Financial flows to developing countries (US $)"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['Financial flows to developing countries (US $)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Financial flows to developing countries (US $)')
plt.show()
No description has been provided for this image
Imputación por Media y Mediana¶
In [ ]:
mean_Financial_flows_to_developing_countries = round(dataset['Financial flows to developing countries (US $)'].mean(), 2)

temp_series = dataset['Financial flows to developing countries (US $)'].fillna(mean_Financial_flows_to_developing_countries)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Financial flows to developing countries (US $) - ' + str(mean_Financial_flows_to_developing_countries))
plt.show()
No description has been provided for this image
In [ ]:
median_Financial_flows_to_developing_countries = round(dataset['Financial flows to developing countries (US $)'].median(), 2)

temp_series = dataset['Financial flows to developing countries (US $)'].fillna(median_Financial_flows_to_developing_countries)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Financial flows to developing countries (US $) - ' + str(median_Financial_flows_to_developing_countries))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por median

In [ ]:
dataset['Financial flows to developing countries (US $)'].fillna(median_Financial_flows_to_developing_countries, inplace=True)
1.1.5 Análisis de Variable "Renewable energy share in the total final energy consumption (%)"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['Renewable energy share in the total final energy consumption (%)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewable energy share in the total final energy consumption (%)')
plt.show()
No description has been provided for this image
Imputación por Media y Mediana¶
In [ ]:
mean_Renewable_energy_share_in_the_total_final_energy_consumption = round(dataset['Renewable energy share in the total final energy consumption (%)'].mean(), 2)

temp_series = dataset['Renewable energy share in the total final energy consumption (%)'].fillna(mean_Renewable_energy_share_in_the_total_final_energy_consumption)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewable energy share in the total final energy consumption (%) - ' + str(mean_Renewable_energy_share_in_the_total_final_energy_consumption))
plt.show()
No description has been provided for this image
In [ ]:
median_Renewable_energy_share_in_the_total_final_energy_consumption = round(dataset['Renewable energy share in the total final energy consumption (%)'].median(), 2)

temp_series = dataset['Renewable energy share in the total final energy consumption (%)'].fillna(median_Renewable_energy_share_in_the_total_final_energy_consumption)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewable energy share in the total final energy consumption (%) - ' + str(median_Renewable_energy_share_in_the_total_final_energy_consumption))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por median

In [ ]:
dataset['Renewable energy share in the total final energy consumption (%)'].fillna(median_Renewable_energy_share_in_the_total_final_energy_consumption, inplace=True)
1.1.6 Análisis de Variable "Electricity from fossil fuels (TWh)"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['Electricity from fossil fuels (TWh)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from fossil fuels (TWh)')
plt.show()
No description has been provided for this image
Imputación por Media y Mediana¶
In [ ]:
mean_Electricity_from_fossil_fuels = round(dataset['Electricity from fossil fuels (TWh)'].mean(), 2)

temp_series = dataset['Electricity from fossil fuels (TWh)'].fillna(mean_Electricity_from_fossil_fuels)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from fossil fuels (TWh) - ' + str(mean_Electricity_from_fossil_fuels))
plt.show()
No description has been provided for this image
In [ ]:
median_Electricity_from_fossil_fuels = round(dataset['Electricity from fossil fuels (TWh)'].median(), 2)

temp_series = dataset['Electricity from fossil fuels (TWh)'].fillna(median_Electricity_from_fossil_fuels)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from fossil fuels (TWh) - ' + str(median_Electricity_from_fossil_fuels))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por median

In [ ]:
dataset['Electricity from fossil fuels (TWh)'].fillna(median_Electricity_from_fossil_fuels, inplace=True)
1.1.7 Análisis de Variable "Electricity from nuclear (TWh)"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['Electricity from nuclear (TWh)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from nuclear (TWh)')
plt.show()
No description has been provided for this image
Imputación por Media y Mediana¶
In [ ]:
mean_Electricity_from_nuclear = round(dataset['Electricity from nuclear (TWh)'].mean(), 2)

temp_series = dataset['Electricity from nuclear (TWh)'].fillna(mean_Electricity_from_nuclear)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from nuclear (TWh) - ' + str(mean_Electricity_from_nuclear))
plt.show()
No description has been provided for this image
In [ ]:
median_Electricity_from_nuclear = round(dataset['Electricity from nuclear (TWh)'].median(), 2)

temp_series = dataset['Electricity from nuclear (TWh)'].fillna(median_Electricity_from_nuclear)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from nuclear (TWh) - ' + str(median_Electricity_from_nuclear))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por median

In [ ]:
dataset['Electricity from nuclear (TWh)'].fillna(median_Electricity_from_nuclear, inplace=True)
1.1.8 Análisis de Variable "Electricity from renewables (TWh)"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['Electricity from renewables (TWh)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from renewables (TWh)')
plt.show()
No description has been provided for this image
Imputación por Media y Mediana¶
In [ ]:
mean_Electricity_from_renewables = round(dataset['Electricity from renewables (TWh)'].mean(), 2)

temp_series = dataset['Electricity from renewables (TWh)'].fillna(mean_Electricity_from_renewables)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from renewables (TWh) - ' + str(mean_Electricity_from_renewables))
plt.show()
No description has been provided for this image
In [ ]:
median_Electricity_from_renewables = round(dataset['Electricity from renewables (TWh)'].median(), 2)

temp_series = dataset['Electricity from renewables (TWh)'].fillna(median_Electricity_from_renewables)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from renewables (TWh) - ' + str(median_Electricity_from_renewables))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por median

In [ ]:
dataset['Electricity from renewables (TWh)'].fillna(median_Electricity_from_renewables, inplace=True)
1.1.9 Análisis de Variable "Low-carbon electricity (% electricity)"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['Low-carbon electricity (% electricity)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Low-carbon electricity (% electricity)')
plt.show()
No description has been provided for this image
Imputación por Media y Mediana¶
In [ ]:
mean_Low_carbon_electricity = round(dataset['Low-carbon electricity (% electricity)'].mean(), 2)

temp_series = dataset['Low-carbon electricity (% electricity)'].fillna(mean_Low_carbon_electricity)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Low-carbon electricity (% electricity) - ' + str(mean_Low_carbon_electricity))
plt.show()
No description has been provided for this image
In [ ]:
median_Low_carbon_electricity = round(dataset['Low-carbon electricity (% electricity)'].median(), 2)

temp_series = dataset['Low-carbon electricity (% electricity)'].fillna(median_Low_carbon_electricity)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Low-carbon electricity (% electricity) - ' + str(median_Low_carbon_electricity))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por median

In [ ]:
dataset['Low-carbon electricity (% electricity)'].fillna(median_Low_carbon_electricity, inplace=True)
1.1.3 Análisis de Variable "Energy intensity level of primary energy (MJ/$2017 PPP GDP)"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Energy intensity level of primary energy (MJ/$2017 PPP GDP)')
plt.show()
No description has been provided for this image
Imputación por Media y Mediana¶
In [ ]:
mean_Energy_intensity_level_of_primary_energy = round(dataset['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].mean(), 2)

temp_series = dataset['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].fillna(mean_Energy_intensity_level_of_primary_energy)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Energy intensity level of primary energy (MJ/$2017 PPP GDP) - ' + str(mean_Energy_intensity_level_of_primary_energy))
plt.show()
No description has been provided for this image
In [ ]:
median_Energy_intensity_level_of_primary_energy = round(dataset['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].median(), 2)

temp_series = dataset['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].fillna(median_Energy_intensity_level_of_primary_energy)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Energy intensity level of primary energy (MJ/$2017 PPP GDP) - ' + str(median_Energy_intensity_level_of_primary_energy))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por median

In [ ]:
dataset['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].fillna(median_Energy_intensity_level_of_primary_energy, inplace=True)
1.1.11 Análisis de Variable "Value_co2_emissions_kt_by_country"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['Value_co2_emissions_kt_by_country'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Value_co2_emissions_kt_by_country')
plt.show()
No description has been provided for this image
Imputación por Media y Mediana¶
In [ ]:
mean_Value_co2_emissions_kt_by_country = round(dataset['Value_co2_emissions_kt_by_country'].mean(), 2)

temp_series = dataset['Value_co2_emissions_kt_by_country'].fillna(mean_Value_co2_emissions_kt_by_country)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Value_co2_emissions_kt_by_country - ' + str(mean_Value_co2_emissions_kt_by_country))
plt.show()
No description has been provided for this image
In [ ]:
median_Value_co2_emissions_kt_by_country = round(dataset['Value_co2_emissions_kt_by_country'].median(), 2)

temp_series = dataset['Value_co2_emissions_kt_by_country'].fillna(median_Value_co2_emissions_kt_by_country)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Value_co2_emissions_kt_by_country - ' + str(median_Value_co2_emissions_kt_by_country))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por median

In [ ]:
dataset['Value_co2_emissions_kt_by_country'].fillna(median_Value_co2_emissions_kt_by_country, inplace=True)
1.1.12 Análisis de Variable "Renewables (% equivalent primary energy)"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['Renewables (% equivalent primary energy)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewables (% equivalent primary energy)')
plt.show()
No description has been provided for this image
Imputación por Media y Mediana¶
In [ ]:
mean_Renewables_equivalent_primary_energy = round(dataset['Renewables (% equivalent primary energy)'].mean(), 2)

temp_series = dataset['Renewables (% equivalent primary energy)'].fillna(mean_Renewables_equivalent_primary_energy)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewables (% equivalent primary energy) - ' + str(mean_Renewables_equivalent_primary_energy))
plt.show()
No description has been provided for this image
In [ ]:
median_Renewables_equivalent_primary_energy = round(dataset['Renewables (% equivalent primary energy)'].median(), 2)

temp_series = dataset['Renewables (% equivalent primary energy)'].fillna(median_Renewables_equivalent_primary_energy)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewables (% equivalent primary energy) - ' + str(median_Renewables_equivalent_primary_energy))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por median

In [ ]:
dataset['Renewables (% equivalent primary energy)'].fillna(median_Renewables_equivalent_primary_energy, inplace=True)
1.1.13 Análisis de Variable "gdp_growth"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['gdp_growth'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('gdp_growth')
plt.show()
No description has been provided for this image
In [ ]:
mean_gdp_growth = round(dataset['gdp_growth'].mean(), 2)

temp_series = dataset['gdp_growth'].fillna(mean_gdp_growth)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('gdp_growth - ' + str(mean_gdp_growth))
plt.show()
No description has been provided for this image
In [ ]:
median_gdp_growth = round(dataset['gdp_growth'].median(), 2)

temp_series = dataset['gdp_growth'].fillna(median_gdp_growth)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('gdp_growth - ' + str(median_gdp_growth))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por mean

In [ ]:
dataset['gdp_growth'].fillna(mean_gdp_growth, inplace=True)
1.1.14 Análisis de Variable "gdp_per_capita"¶
In [ ]:
fig = plt.figure(figsize=(5, 3))
dataset['gdp_per_capita'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('gdp_per_capita')
plt.show()
No description has been provided for this image
In [ ]:
mean_gdp_per_capita = round(dataset['gdp_per_capita'].mean(), 2)

temp_series = dataset['gdp_per_capita'].fillna(mean_gdp_per_capita)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('gdp_per_capita - ' + str(mean_gdp_per_capita))
plt.show()
No description has been provided for this image
In [ ]:
median_gdp_per_capita = round(dataset['gdp_per_capita'].median(), 2)

temp_series = dataset['gdp_per_capita'].fillna(median_gdp_per_capita)

fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('gdp_per_capita - ' + str(median_gdp_per_capita))
plt.show()
No description has been provided for this image

Utilizamos la imputacion por median

In [ ]:
dataset['gdp_per_capita'].fillna(median_gdp_per_capita, inplace=True)

1.2 - Imputación de variables numéricas categoricas con faltante¶

No se realizara ninguna imputacion las las variables discretas y categoricas dado que no tienen faltantes

Para finalizar verificamos el porcentaje de faltantes en todas las columnas nuevamente, para asegurarnos que todos los faltantes se hayan tratado.

In [ ]:
pd.DataFrame(dataset.isnull().mean()).transpose()
Out[ ]:
Entity Year Access to electricity (% of population) Access to clean fuels for cooking Renewable-electricity-generating-capacity-per-capita Financial flows to developing countries (US $) Renewable energy share in the total final energy consumption (%) Electricity from fossil fuels (TWh) Electricity from nuclear (TWh) Electricity from renewables (TWh) ... Primary energy consumption per capita (kWh/person) Energy intensity level of primary energy (MJ/$2017 PPP GDP) Value_co2_emissions_kt_by_country Renewables (% equivalent primary energy) gdp_growth gdp_per_capita Density (P/Km2) Land Area(Km2) Latitude Longitude
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 rows × 21 columns

Escritura de Archivo de Variables a Disco.¶

In [ ]:
dataset.to_csv("fase_1_proy.csv", index=False)

2. Para las variables que se consideren continuas aplique el tratamiento de Outliers¶

Capping de final de cola¶

In [ ]:
dataset_proy =pd.read_csv("fase_1_proy.csv")
dataset_proy.head()
Out[ ]:
Entity Year Access to electricity (% of population) Access to clean fuels for cooking Renewable-electricity-generating-capacity-per-capita Financial flows to developing countries (US $) Renewable energy share in the total final energy consumption (%) Electricity from fossil fuels (TWh) Electricity from nuclear (TWh) Electricity from renewables (TWh) ... Primary energy consumption per capita (kWh/person) Energy intensity level of primary energy (MJ/$2017 PPP GDP) Value_co2_emissions_kt_by_country Renewables (% equivalent primary energy) gdp_growth gdp_per_capita Density (P/Km2) Land Area(Km2) Latitude Longitude
0 Afghanistan 2000 1.613591 6.2 9.22 20000.0 44.99 0.16 0.0 0.31 ... 302.59482 1.64 760.000000 6.29 3.440000 4578.630000 60 652230 33.93911 67.709953
1 Afghanistan 2001 4.074574 7.2 8.86 130000.0 45.60 0.09 0.0 0.50 ... 236.89185 1.74 730.000000 6.29 3.440000 4578.630000 60 652230 33.93911 67.709953
2 Afghanistan 2002 9.409158 8.2 8.47 3950000.0 37.83 0.13 0.0 0.56 ... 210.86215 1.40 1029.999971 6.29 3.440000 179.426579 60 652230 33.93911 67.709953
3 Afghanistan 2003 14.738506 9.5 8.09 25970000.0 36.66 0.31 0.0 0.63 ... 229.96822 1.40 1220.000029 6.29 8.832278 190.683814 60 652230 33.93911 67.709953
4 Afghanistan 2004 20.064968 10.9 7.75 5665000.0 44.24 0.33 0.0 0.56 ... 204.23125 1.20 1029.999971 6.29 1.414118 211.382074 60 652230 33.93911 67.709953

5 rows × 21 columns

In [ ]:
def get_variables_scale(dataset):
    continuas = [col for col in dataset.columns if dataset[col].dtype in ['float64','int64'] and len(dataset[col].unique())>30]
    discretas = [col for col in dataset.columns if dataset[col].dtype in ['float64','int64'] and len(dataset[col].unique())<=30]

    return continuas, discretas
In [ ]:
cont, disct = get_variables_scale(dataset_proy)
In [ ]:
cont
Out[ ]:
['Access to electricity (% of population)',
 'Access to clean fuels for cooking',
 'Renewable-electricity-generating-capacity-per-capita',
 'Financial flows to developing countries (US $)',
 'Renewable energy share in the total final energy consumption (%)',
 'Electricity from fossil fuels (TWh)',
 'Electricity from nuclear (TWh)',
 'Electricity from renewables (TWh)',
 'Low-carbon electricity (% electricity)',
 'Primary energy consumption per capita (kWh/person)',
 'Energy intensity level of primary energy (MJ/$2017 PPP GDP)',
 'Value_co2_emissions_kt_by_country',
 'Renewables (% equivalent primary energy)',
 'gdp_growth',
 'gdp_per_capita',
 'Land Area(Km2)',
 'Latitude',
 'Longitude']
In [ ]:
# creamos un dataframe para las variables continuas 
proy_cont = pd.DataFrame(dataset_proy)
cont, disct = get_variables_scale(proy_cont)
df_continuas = proy_cont[cont]
df_continuas.head()
Out[ ]:
Access to electricity (% of population) Access to clean fuels for cooking Renewable-electricity-generating-capacity-per-capita Financial flows to developing countries (US $) Renewable energy share in the total final energy consumption (%) Electricity from fossil fuels (TWh) Electricity from nuclear (TWh) Electricity from renewables (TWh) Low-carbon electricity (% electricity) Primary energy consumption per capita (kWh/person) Energy intensity level of primary energy (MJ/$2017 PPP GDP) Value_co2_emissions_kt_by_country Renewables (% equivalent primary energy) gdp_growth gdp_per_capita Land Area(Km2) Latitude Longitude
0 1.613591 6.2 9.22 20000.0 44.99 0.16 0.0 0.31 65.957440 302.59482 1.64 760.000000 6.29 3.440000 4578.630000 652230 33.93911 67.709953
1 4.074574 7.2 8.86 130000.0 45.60 0.09 0.0 0.50 84.745766 236.89185 1.74 730.000000 6.29 3.440000 4578.630000 652230 33.93911 67.709953
2 9.409158 8.2 8.47 3950000.0 37.83 0.13 0.0 0.56 81.159424 210.86215 1.40 1029.999971 6.29 3.440000 179.426579 652230 33.93911 67.709953
3 14.738506 9.5 8.09 25970000.0 36.66 0.31 0.0 0.63 67.021280 229.96822 1.40 1220.000029 6.29 8.832278 190.683814 652230 33.93911 67.709953
4 20.064968 10.9 7.75 5665000.0 44.24 0.33 0.0 0.56 62.921350 204.23125 1.20 1029.999971 6.29 1.414118 211.382074 652230 33.93911 67.709953
In [ ]:
# Funcion para graficar las variables de la columna hotel_cont
def plot_outliers_analysis(dataset, col):
    plt.figure(figsize=(10,2))
    print(col)
    plt.subplot(131)
    dataset[col].hist(bins=50, density=True, color='red')
    plt.title("Densidad -Histograma")
    plt.subplot(132)
    stats.probplot(dataset[col], dist = "norm", plot=plt)
    plt.title("QQ-Plot")
    plt.subplot(133)
    sns.boxplot(y=dataset[col])
    plt.title("Boxplot")
    plt.show()
In [ ]:
for col in cont:
    plot_outliers_analysis(proy_cont, col)
Access to electricity (% of population)
No description has been provided for this image
Access to clean fuels for cooking
No description has been provided for this image
Renewable-electricity-generating-capacity-per-capita
No description has been provided for this image
Financial flows to developing countries (US $)
No description has been provided for this image
Renewable energy share in the total final energy consumption (%)
No description has been provided for this image
Electricity from fossil fuels (TWh)
No description has been provided for this image
Electricity from nuclear (TWh)
No description has been provided for this image
Electricity from renewables (TWh)
No description has been provided for this image
Low-carbon electricity (% electricity)
No description has been provided for this image
Primary energy consumption per capita (kWh/person)
No description has been provided for this image
Energy intensity level of primary energy (MJ/$2017 PPP GDP)
No description has been provided for this image
Value_co2_emissions_kt_by_country
No description has been provided for this image
Renewables (% equivalent primary energy)
No description has been provided for this image
gdp_growth
No description has been provided for this image
gdp_per_capita
No description has been provided for this image
Land Area(Km2)
No description has been provided for this image
Latitude
No description has been provided for this image
Longitude
No description has been provided for this image
In [ ]:
# Funcion para la detecion de Outliers
def get_outliers_limits(dataset, col1):
    IQR = dataset_proy[col].quantile(0.75)-dataset_proy[col].quantile(0.25)
    LI = dataset_proy[col].quantile(0.25) -(1.5*IQR)
    LS = dataset_proy[col].quantile(0.75) + (1.5*IQR)

    return LI, LS 
In [ ]:
get_outliers_limits(df_continuas, df_continuas.columns)
Out[ ]:
(-98.7491465, 133.1688735)
In [ ]:
#Creamos un nuevo dataframe para guardar las variables que se les aplico outliers

capped_df = pd.DataFrame()

for col in df_continuas.columns:
    LI, LS = get_outliers_limits(df_continuas, col)
    capped_df[col] = np.where(df_continuas[col] > LS, LS,
                              np.where(df_continuas[col] < LI, LI,
                                       df_continuas[col]))


capped_df.head()
Out[ ]:
Access to electricity (% of population) Access to clean fuels for cooking Renewable-electricity-generating-capacity-per-capita Financial flows to developing countries (US $) Renewable energy share in the total final energy consumption (%) Electricity from fossil fuels (TWh) Electricity from nuclear (TWh) Electricity from renewables (TWh) Low-carbon electricity (% electricity) Primary energy consumption per capita (kWh/person) Energy intensity level of primary energy (MJ/$2017 PPP GDP) Value_co2_emissions_kt_by_country Renewables (% equivalent primary energy) gdp_growth gdp_per_capita Land Area(Km2) Latitude Longitude
0 1.613591 6.2 9.22 5665000.0 44.99 0.16 0.0 0.31 65.957440 302.59482 1.64 760.000000 6.29 3.440000 4578.630000 652230.0 33.93911 67.709953
1 4.074574 7.2 8.86 5665000.0 45.60 0.09 0.0 0.50 84.745766 236.89185 1.74 730.000000 6.29 3.440000 4578.630000 652230.0 33.93911 67.709953
2 9.409158 8.2 8.47 5665000.0 37.83 0.13 0.0 0.56 81.159424 210.86215 1.40 1029.999971 6.29 3.440000 179.426579 652230.0 33.93911 67.709953
3 14.738506 9.5 8.09 5665000.0 36.66 0.31 0.0 0.63 67.021280 229.96822 1.40 1220.000029 6.29 8.832278 190.683814 652230.0 33.93911 67.709953
4 20.064968 10.9 7.75 5665000.0 44.24 0.33 0.0 0.56 62.921350 204.23125 1.20 1029.999971 6.29 1.414118 211.382074 652230.0 33.93911 67.709953
In [ ]:
get_outliers_limits(capped_df, capped_df.columns)
Out[ ]:
(-98.7491465, 133.1688735)
In [ ]:
#Graficamos todas las variables de nuestro nuevo daraframe

for col in cont:
    plot_outliers_analysis(capped_df, col)
Access to electricity (% of population)
No description has been provided for this image
Access to clean fuels for cooking
No description has been provided for this image
Renewable-electricity-generating-capacity-per-capita
No description has been provided for this image
Financial flows to developing countries (US $)
No description has been provided for this image
Renewable energy share in the total final energy consumption (%)
No description has been provided for this image
Electricity from fossil fuels (TWh)
No description has been provided for this image
Electricity from nuclear (TWh)
No description has been provided for this image
Electricity from renewables (TWh)
No description has been provided for this image
Low-carbon electricity (% electricity)
No description has been provided for this image
Primary energy consumption per capita (kWh/person)
No description has been provided for this image
Energy intensity level of primary energy (MJ/$2017 PPP GDP)
No description has been provided for this image
Value_co2_emissions_kt_by_country
No description has been provided for this image
Renewables (% equivalent primary energy)
No description has been provided for this image
gdp_growth
No description has been provided for this image
gdp_per_capita
No description has been provided for this image
Land Area(Km2)
No description has been provided for this image
Latitude
No description has been provided for this image
Longitude
No description has been provided for this image

3. Posteriormente para las variables tratadas con outliers, verifique la forma de ladistribución y determine si es necesario aplicar algún tipo de transformación devariables para mejorar la forma de las distribuciones.¶

De ser el caso aplique la transformación que considere pertinente a fin de normalizar lo más posible la distribución de probabilidad de cada variable y mejorar el rendimiento del modelo. Recuerde que puede aplicar las siguientes transformaciones:¶

a. Logarítmica,¶

b. Exponencial,¶

c. Polinomial,¶

d. Box-Cox,¶

e. Yeo-Johnson¶

Funcion para graficar la dencidad

In [ ]:
def plot_density_qq(df,variable):

    plt.figure(figsize=(7,4))

    plt.subplot(121)
    df[variable].hist(bins=30)
    plt.title(variable)

    plt.subplot(122)
    stats.probplot(df[variable], dist= 'norm', plot=plt)
    plt.show()

1. Para "Access to electricity (% of population)"¶

In [ ]:
col = "Access to electricity (% of population)"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Transformacion logaritmica

capped_df[col + '_log'] = np.log(capped_df[col])
plot_density_qq(capped_df, col + '_log')
No description has been provided for this image
In [ ]:
#Transformacion Inversa

capped_df[col + '_inv'] = 1/(capped_df[col])
plot_density_qq(capped_df, col + '_inv')
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Box -Cox

capped_df[col + '_BC'], lmbd = stats.boxcox(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_BC')
1.9652
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
2.029
No description has been provided for this image

2. Para "Access to clean fuels for cooking"¶

In [ ]:
col = "Access to clean fuels for cooking"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.8097
No description has been provided for this image

3. Para "Renewable-electricity-generating-capacity-per-capita"¶

In [ ]:
col = "Renewable-electricity-generating-capacity-per-capita"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.2644
No description has been provided for this image

3. Para "Renewable-electricity-generating-capacity-per-capita"¶

In [ ]:
col = "Renewable-electricity-generating-capacity-per-capita"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.2644
No description has been provided for this image

4. Para "Financial flows to developing countries (US $)"¶

In [ ]:
col = "Financial flows to developing countries (US $)"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Transformacion logaritmica

capped_df[col + '_log'] = np.log(capped_df[col])
plot_density_qq(capped_df, col + '_log')
No description has been provided for this image
In [ ]:
#Transformacion Inversa

capped_df[col + '_inv'] = 1/(capped_df[col])
plot_density_qq(capped_df, col + '_inv')
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.1967
No description has been provided for this image

5. Para "Renewable energy share in the total final energy consumption (%)"¶

In [ ]:
col = "Renewable energy share in the total final energy consumption (%)"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.307
No description has been provided for this image

6. Para "Electricity from fossil fuels (TWh)"¶

In [ ]:
col = "Electricity from fossil fuels (TWh)"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
-0.1922
No description has been provided for this image

7. Para "Electricity from nuclear (TWh)"¶

In [ ]:
col = "Electricity from nuclear (TWh)"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image

8. Para "Electricity from renewables (TWh)"¶

In [ ]:
col = "Electricity from renewables (TWh)"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
-0.3133
No description has been provided for this image

9. Para "Low-carbon electricity (% electricity)"¶

In [ ]:
col = "Low-carbon electricity (% electricity)"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.3055
No description has been provided for this image

10. Para "Energy intensity level of primary energy (MJ/$2017 PPP GDP)"¶

In [ ]:
col = "Energy intensity level of primary energy (MJ/$2017 PPP GDP)"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Transformacion logaritmica

capped_df[col + '_log'] = np.log(capped_df[col])
plot_density_qq(capped_df, col + '_log')
No description has been provided for this image
In [ ]:
#Transformacion Inversa

capped_df[col + '_inv'] = 1/(capped_df[col])
plot_density_qq(capped_df, col + '_inv')
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Box -Cox

capped_df[col + '_BC'], lmbd = stats.boxcox(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_BC')
0.3158
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.0172
No description has been provided for this image

11. Para "Value_co2_emissions_kt_by_country"¶

In [ ]:
col = "Value_co2_emissions_kt_by_country"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Transformacion logaritmica

capped_df[col + '_log'] = np.log(capped_df[col])
plot_density_qq(capped_df, col + '_log')
No description has been provided for this image
In [ ]:
#Transformacion Inversa

capped_df[col + '_inv'] = 1/(capped_df[col])
plot_density_qq(capped_df, col + '_inv')
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Box -Cox

capped_df[col + '_BC'], lmbd = stats.boxcox(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_BC')
0.1401
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.1394
No description has been provided for this image

12. Para "Renewables (% equivalent primary energy)"¶

In [ ]:
col = "Renewables (% equivalent primary energy)"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Transformacion logaritmica

capped_df[col + '_log'] = np.log(capped_df[col])
plot_density_qq(capped_df, col + '_log')
No description has been provided for this image
In [ ]:
#Transformacion Inversa

capped_df[col + '_inv'] = 1/(capped_df[col])
plot_density_qq(capped_df, col + '_inv')
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
4.5589
No description has been provided for this image

13. Para "gdp_growth"¶

In [ ]:
col = "gdp_growth"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
1.0638
No description has been provided for this image

14. Para "gdp_per_capita"¶

In [ ]:
col = "gdp_per_capita"
plot_density_qq(capped_df, col)
No description has been provided for this image
In [ ]:
#Transformacion logaritmica

capped_df[col + '_log'] = np.log(capped_df[col])
plot_density_qq(capped_df, col + '_log')
No description has been provided for this image
In [ ]:
#Transformacion Inversa

capped_df[col + '_inv'] = 1/(capped_df[col])
plot_density_qq(capped_df, col + '_inv')
No description has been provided for this image
In [ ]:
#Trasformacion Polinomial orden 2

capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
No description has been provided for this image
In [ ]:
#Transformacion Box -Cox

capped_df[col + '_BC'], lmbd = stats.boxcox(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_BC')
0.0841
No description has been provided for this image
In [ ]:
#Transformacion Yeo Johnson

capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.0838
No description has been provided for this image

5. Finalmente una vez aplicadas todas las transformaciones descritas anteriormente, deberá aplicar el escalado de variables a todo el dataset. Recuerde que puede aplicar los siguientes tipos de feature scaling.¶

In [ ]:
dataset_proy.describe()
Out[ ]:
Year Access to electricity (% of population) Access to clean fuels for cooking Renewable-electricity-generating-capacity-per-capita Financial flows to developing countries (US $) Renewable energy share in the total final energy consumption (%) Electricity from fossil fuels (TWh) Electricity from nuclear (TWh) Electricity from renewables (TWh) Low-carbon electricity (% electricity) Primary energy consumption per capita (kWh/person) Energy intensity level of primary energy (MJ/$2017 PPP GDP) Value_co2_emissions_kt_by_country Renewables (% equivalent primary energy) gdp_growth gdp_per_capita Land Area(Km2) Latitude Longitude
count 3648.000000 3648.000000 3648.000000 3648.000000 3.648000e+03 3648.000000 3648.000000 3648.000000 3648.000000 3648.000000 3648.000000 3648.000000 3.648000e+03 3648.000000 3648.000000 3648.000000 3.648000e+03 3648.000000 3648.000000
mean 2010.041118 78.933693 63.255504 92.493616 4.353562e+07 32.142379 69.996209 12.989315 23.845069 36.708904 25747.285360 5.250461 1.423831e+05 8.651135 3.441471 12613.230060 6.332135e+05 18.246388 14.822695
std 6.052776 30.238162 38.133777 213.395777 1.998022e+08 29.168672 347.131749 71.786293 104.157507 34.129226 34777.415694 3.438690 7.285451e+05 10.051457 5.434772 19077.099547 1.585519e+06 24.159232 66.348148
min 2000.000000 1.252269 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.110000 1.000000e+01 0.000000 -62.075920 111.927225 2.100000e+01 -40.900557 -175.198242
25% 2005.000000 59.916941 25.862500 8.380000 5.665000e+06 7.095000 0.300000 0.000000 0.050000 3.030303 3116.636825 3.220000 2.509557e+03 6.290000 1.651476 1464.841885 2.571300e+04 3.202778 -11.779889
50% 2010.000000 98.272340 78.875000 32.880000 5.665000e+06 23.270000 2.970000 0.000000 1.470000 27.910000 13118.841000 4.300000 1.050000e+04 6.290000 3.440000 4578.630000 1.176000e+05 17.189877 19.145136
75% 2015.000000 100.000000 100.000000 67.472500 5.665000e+06 52.612500 26.527500 0.000000 9.560000 64.038130 33897.402500 5.880000 5.136250e+04 6.290000 5.543696 13993.509465 5.131200e+05 38.969719 46.199616
max 2020.000000 100.000000 100.000000 3060.190000 5.202310e+09 96.040000 5184.130000 809.410000 2184.940000 100.000010 262585.700000 32.570000 1.070722e+07 86.836586 123.139555 123514.196700 9.984670e+06 64.963051 178.065032
In [ ]:
#Creamos la funcion para evalucaion de las escala de nuestro dataFrame 

def min_max_scale(df):
    scaled_df = pd.DataFrame()
    for col in df.columns:
        if pd.api.types.is_numeric_dtype(df[col]):
            min_val = df[col].min()
            max_val = df[col].max()
            scaled_df[col + '_minMaxScaled'] = (df[col] - min_val) / (max_val - min_val)
        else:
           scaled_df[col] = df[col]
    return scaled_df
df_continuas = pd.read_csv("fase_1_proy.csv")
In [ ]:
scaled_df = min_max_scale(df_continuas)

# Mostrar las primeras filas del DataFrame escalado
scaled_df.head()
Out[ ]:
Entity Year_minMaxScaled Access to electricity (% of population)_minMaxScaled Access to clean fuels for cooking_minMaxScaled Renewable-electricity-generating-capacity-per-capita_minMaxScaled Financial flows to developing countries (US $)_minMaxScaled Renewable energy share in the total final energy consumption (%)_minMaxScaled Electricity from fossil fuels (TWh)_minMaxScaled Electricity from nuclear (TWh)_minMaxScaled Electricity from renewables (TWh)_minMaxScaled ... Primary energy consumption per capita (kWh/person)_minMaxScaled Energy intensity level of primary energy (MJ/$2017 PPP GDP)_minMaxScaled Value_co2_emissions_kt_by_country_minMaxScaled Renewables (% equivalent primary energy)_minMaxScaled gdp_growth_minMaxScaled gdp_per_capita_minMaxScaled Density (P/Km2) Land Area(Km2)_minMaxScaled Latitude_minMaxScaled Longitude_minMaxScaled
0 Afghanistan 0.00 0.003659 0.062 0.003013 0.000004 0.468451 0.000031 0.0 0.000142 ... 0.001152 0.047135 0.000070 0.072435 0.353728 0.036196 60 0.065321 0.706944 0.687612
1 Afghanistan 0.05 0.028581 0.072 0.002895 0.000025 0.474802 0.000017 0.0 0.000229 ... 0.000902 0.050216 0.000067 0.072435 0.353728 0.036196 60 0.065321 0.706944 0.687612
2 Afghanistan 0.10 0.082603 0.082 0.002768 0.000759 0.393898 0.000025 0.0 0.000256 ... 0.000803 0.039741 0.000095 0.072435 0.353728 0.000547 60 0.065321 0.706944 0.687612
3 Afghanistan 0.15 0.136573 0.095 0.002644 0.004992 0.381716 0.000060 0.0 0.000288 ... 0.000876 0.039741 0.000113 0.072435 0.382842 0.000638 60 0.065321 0.706944 0.687612
4 Afghanistan 0.20 0.190513 0.109 0.002533 0.001089 0.460641 0.000064 0.0 0.000256 ... 0.000778 0.033580 0.000095 0.072435 0.342790 0.000806 60 0.065321 0.706944 0.687612

5 rows × 21 columns